-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast and generic implementation using OpenMP and CUDA #45
base: main
Are you sure you want to change the base?
Conversation
@shikishima-TasakiLab My compilation fails with the error info: src/pytorch_wrapper.cpp:12:65: required from here
/usr/include/c++/8/bits/move.h:87:21: error: static assertion failed: template argument substituting _Tp is an lvalue reference type
static_assert(!std::is_lvalue_reference<_Tp>::value, "template argument"
^~~~
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1 |
@d-li14 Are you running the following command in the environment where PyTorch is installed to compile it? python3 setup.py build or python3 setup.py install |
@shikishima-TasakiLab Yes, I am running |
@d-li14 In the following Docker environment, I was able to build. |
@shikishima-TasakiLab |
@d-li14 |
@d-li14 |
@shikishima-TasakiLab |
Hi, thank you very much for implementing this, it seems to work very well in full precision mode. However, I do get some issues with numerical stability when used automatic mixed precision training (loss goes to nan in a few steps). I am guessing that the CUDA implementation expects a full precision input but AMP gives it half precision. As a quick workaround to I made a patch to _involution2d so I could at least use the rest of my network with mixed precision while using this. def _involution2d(
input: torch.Tensor,
weight: torch.Tensor,
kernel_size: Union[int, Tuple[int, int]] = 7,
stride: Union[int, Tuple[int, int]] = 1,
padding: Union[int, Tuple[int, int]] = 0,
dilation: Union[int, Tuple[int, int]] = 1,
groups: int = 1,
bias: torch.Tensor = None,
) -> torch.Tensor:
kernel_size_ = _pair(kernel_size)
stride_ = _pair(stride)
padding_ = _pair(padding)
dilation_ = _pair(dilation)
if input.dtype == torch.half:
input = input.float()
output: torch.Tensor = ops.involution.involution2d(input, weight, kernel_size_, stride_, padding_, dilation_, groups)
if bias is not None:
output += bias.view(1, -1, 1, 1)
return output |
@shikishima-TasakiLab |
at::Tensor involution2d_autocast( | ||
const torch::autograd::Variable& input, | ||
const torch::autograd::Variable& weight, | ||
const std::vector<int64_t>& kernel_size, | ||
const std::vector<int64_t>& stride, | ||
const std::vector<int64_t>& padding, | ||
const std::vector<int64_t>& dilation, | ||
const int64_t groups | ||
) { | ||
c10::impl::ExcludeDispatchKeyGuard no_autocast(c10::DispatchKey::Autocast); | ||
auto exec_type = at::autocast::promote_type(at::kFloat, input, weight); | ||
return involution2d_autograd( | ||
at::autocast::cached_cast(exec_type, input), | ||
at::autocast::cached_cast(exec_type, weight), | ||
kernel_size, stride, padding, dilation, groups | ||
); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@csvance
Fixed CUDA implementation input to be full precision using Autocast.
include/involution2d_cuda.cuh
Outdated
#define CUDA_MAX_THREADS 512u | ||
#define CUDA_MAX_THREADS 1024u | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@d-li14
In your CuPy implementation, the maximum number of CUDA threads was set to 1024. However, when I experimented, my CuPy reimplementation did not work with 1024, so I set it to 512.
My CUDA implementation does work with 1024. However, when I experimented, I set it to 512 and forgot to change it back to 1024.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shikishima-TasakiLab
Thanks, but I have tried to change the maximum CUDA threads, and it seems the result is still similar.
@@ -27,7 +35,7 @@ | |||
], | |||
extra_compile_args={ | |||
'cxx': EXTRA_COMPILE_ARGS, | |||
'nvcc': ['-O3'], | |||
'nvcc': ['-O3'] + GENERATE_CODES, | |||
} | |||
) | |||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@d-li14
I changed the arguments of NVCC to be optimized for different architectures.
Hopefully this will increase the speed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shikishima-TasakiLab
Thanks for your efforts. However, the new code still does not lead to an expected speedup from my side.
close #44